Antipattern: Pattern Matching Predicates
Let’s look at how we can use LIKE and REGEXP predicates for searching from a large collection.
SQL provides pattern-matching predicates for comparing strings, and this is the first solution most programmers use when searching for keywords.
Using pattern matching LIKE
predicates#
The most widely supported of these is the LIKE
predicate.
The LIKE
predicate supports a wildcard (%
) that matches zero or more characters. Using this wildcard before and after a keyword matches any string that contains that word. The first wildcard matches any text preceding the word, and the second wildcard matches any text following the word.
SELECT * FROM Bugs WHERE description LIKE '%crash%';
Using regular expressions for pattern matching#
Regular expressions are also supported by many database brands, although not in a standard way. You don’t need wildcards, because conventionally regular expressions match the pattern against any substring anyway. Here’s an example using MySQL’s regular expression predicate:
SELECT * FROM Bugs WHERE description REGEXP 'crash';
Limitations of using pattern matching#
The most important disadvantage of pattern matching operators is that they have poor performance. They can’t benefit from a conventional index, so they must scan every row in a table. Since matching a pattern against a string column is an expensive operation (relative to, for instance, comparing two integers for equality), the total cost of a table scan for this search is very high.
The second problem of simple pattern matching using LIKE
or regular expressions is that it can find un intended matches.
SELECT * FROM Bugs WHERE description LIKE '%one%';
The previous example matches text that contains the word “one”, but it also matches the strings “money”, “prone”, “lonely”, and so on. Searching for a pattern with the keyword delimited by spaces doesn’t match occurrences of the word with punctuation or at the start or end of the text. The regular expressions supported by our database might support a special pattern for a word boundary to solve this issue:
SELECT * FROM Bugs WHERE description REGEXP '[[:<:]]one[[:>:]]';
The query is executed successfully, and there are no results returned. It means that the expression did not return false matches.
Given the problems of performance and scalability and the gymnastics we have to perform to prevent irrelevant matches, simple pattern matching is a poor technique for searching for keywords.